## here() starts at /home/rkisleva/ieo/ieoproject-bronchial-detectives

1 Introduction

The severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) is a highly pathogenic human zoonotic coronavirus, which causes Coronavirus disease 2019 (COVID-19). In an effort to understand the host transcriptional response to the SARS-Cov-2 virus, Blanco-Melo et al. (2020) sequenced the transcriptome of two different human cell lines, human alveolar adenocarcinoma cells (A549) and primary human bronchial epithelial (NHBE) cells, after infecting them with SARS-Cov-2, seasonal influenza A virus (IAV) and human respiratory syncytial virus (RSV), and growing them in the same culture conditions without infection (mock).

The resulting raw RNA-seq data have been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession GSE147507. Here, we show a first exploratory analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (https://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a SummarizedExperiment object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.

The directories in this package template follow the structure of an R package (consult this link for further information) adapted for the purpose of data analysis:

  • R: functions to avoid repeating steps and store auxiliary code, used in the vignette.

  • vignettes: Rmarkdown document of the data analysis.

  • inst/extdata: repository of the data to be analyzed and any other kind of additional functional and annotation data employed during the analysis and which is unavailable through an R package. This directory is moved to the package root directory at install.

  • inst/doc: repository for the results of the analysis that we want to provide without having to run the entire analysis again, e.g., tables of differentially expressed genes. This directory is moved to the package root directory at install, where also the vignettes are installed.

  • man: manual pages of functions defined in the R directory that for some reason we want to export from the package.

Every other directory you see in this package has been automatically generated in the package building process as, for instance, the doc directory with the result of the vignette.

When you edit and adapt this vignette for your own data analysis project, you should do it in the .Rmd file of the vignettes directory (not the one copied to the doc directory because this one is overwritten during package building and if you edit there you will loose your edits). To build this vignette without building the entire package, you should type the following in the R shell:

devtools::build_vignettes()

This function call will build your vignette and copy the resulting HTML to the doc directory. Thus, to see the result, you should go there and open that HTML file.

The rest of the documentation of this package is provided within the files of the R directory using roxygen2, which means that before you build the entire package you have to generate the documentation and NAMESPACE file typing in the R shell:

devtools::document()

Both steps, calling devtools::build_vignettes() and devtools::document() have to be done from the root directory of your package.

2 Quality assessment

2.1 Data import and cleaning

We start importing the raw table of counts.

library(SummarizedExperiment)

se <- readRDS(file.path(system.file("extdata",
                                    package="IEOproject"),
                        "GSE79209.rds"))
se
class: RangedSummarizedExperiment 
dim: 25122 81 
metadata(4): experimentData annotation ensemblVersion urlProcessedData
assays(1): counts
rownames(25122): 1 10 ... 9994 9997
rowData names(5): gene_id gene_biotype description gene_id_version
  symbol
colnames(81): SRR3225255 SRR3225256 ... SRR3225335 SRR3225336
colData names(68): title geo_accession ... total alignments:ch1 unique
  alignments:ch1

We have 25122 genes by 81 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run (SRR) identifiers.

The row data in this object contains information about the profiled genes.

head(rowData(se))
DataFrame with 6 rows and 5 columns
                  gene_id   gene_biotype            description
              <character>    <character>            <character>
1         ENSG00000121410 protein_coding alpha-1-B glycoprote..
10        ENSG00000156006 protein_coding N-acetyltransferase ..
100       ENSG00000196839 protein_coding adenosine deaminase ..
1000      ENSG00000170558 protein_coding cadherin 2 [Source:H..
10000     ENSG00000117020 protein_coding AKT serine/threonine..
100008586 ENSG00000236362 protein_coding G antigen 12F [Sourc..
             gene_id_version      symbol
                 <character> <character>
1         ENSG00000121410.11        A1BG
10         ENSG00000156006.4        NAT2
100       ENSG00000196839.12         ADA
1000       ENSG00000170558.8        CDH2
10000     ENSG00000117020.16        AKT3
100008586  ENSG00000236362.8     GAGE12F

Among this information, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis. Let’s explore now the column (phenotypic) data.

dim(colData(se))
[1] 81 68
head(colData(se), n=3)
DataFrame with 3 rows and 68 columns
                            title geo_accession                status
                      <character>   <character>           <character>
SRR3225255  Subject L1, dysplasia    GSM2088002 Public on May 25 2017
SRR3225256 Subject L10, dysplasia    GSM2088003 Public on May 25 2017
SRR3225257 Subject L11, dysplasia    GSM2088004 Public on May 25 2017
           submission_date last_update_date        type channel_count
               <character>      <character> <character>   <character>
SRR3225255     Mar 14 2016      May 15 2019         SRA             1
SRR3225256     Mar 14 2016      May 15 2019         SRA             1
SRR3225257     Mar 14 2016      May 15 2019         SRA             1
              source_name_ch1 organism_ch1    characteristics_ch1
                  <character>  <character>            <character>
SRR3225255 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225256 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225257 Bronchial brushing Homo sapiens tissue: Bronchial br..
           characteristics_ch1.1 characteristics_ch1.2  characteristics_ch1.3
                     <character>           <character>            <character>
SRR3225255               age: 69             Sex: Male smoking status: Form..
SRR3225256               age: 50           Sex: Female smoking status: Curr..
SRR3225257               age: 49             Sex: Male smoking status: Curr..
           characteristics_ch1.4 characteristics_ch1.5  characteristics_ch1.6
                     <character>           <character>            <character>
SRR3225255        pack years: 58  copd status: No COPD max histology: Mild ..
SRR3225256        pack years: 34  copd status: No COPD max histology: Mild ..
SRR3225257        pack years: 35  copd status: No COPD max histology: Moder..
            characteristics_ch1.7  characteristics_ch1.8  characteristics_ch1.9
                      <character>            <character>            <character>
SRR3225255 dysplasia status: Dy.. total alignments: 91.. unique alignments: 8..
SRR3225256 dysplasia status: Dy.. total alignments: 96.. unique alignments: 8..
SRR3225257 dysplasia status: Dy.. total alignments: 12.. unique alignments: 1..
           characteristics_ch1.10 characteristics_ch1.11 characteristics_ch1.12
                      <character>            <character>            <character>
SRR3225255 read 1 alignment: 42.. read 2 alignment: 41.. positive strand alig..
SRR3225256 read 1 alignment: 45.. read 2 alignment: 43.. positive strand alig..
SRR3225257 read 1 alignment: 56.. read 2 alignment: 54.. positive strand alig..
           characteristics_ch1.13 characteristics_ch1.14 characteristics_ch1.15
                      <character>            <character>            <character>
SRR3225255 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225256 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225257 negative strand alig.. non-splice alignment.. splice alignments: 3..
           characteristics_ch1.16 characteristics_ch1.17 characteristics_ch1.18
                      <character>            <character>            <character>
SRR3225255 properly paired alig.. genebody 80/20 ratio.. mean gc content: 53.54
SRR3225256 properly paired alig.. genebody 80/20 ratio.. mean gc content: 45.45
SRR3225257 properly paired alig.. genebody 80/20 ratio.. mean gc content: 44.44
           molecule_ch1   extract_protocol_ch1 extract_protocol_ch1.1
            <character>            <character>            <character>
SRR3225255    polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225256    polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225257    polyA RNA Bronchial airway bru.. Sequencing libraries..
             taxid_ch1 description        data_processing
           <character> <character>            <character>
SRR3225255        9606   Sample_L1 Demultiplexing and c..
SRR3225256        9606  Sample_L10 Demultiplexing and c..
SRR3225257        9606  Sample_L11 Demultiplexing and c..
                data_processing.1      data_processing.2      data_processing.3
                      <character>            <character>            <character>
SRR3225255 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225256 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225257 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
                data_processing.4  data_processing.5      data_processing.6
                      <character>        <character>            <character>
SRR3225255 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225256 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225257 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
           platform_id data_row_count    instrument_model library_selection
           <character>    <character>         <character>       <character>
SRR3225255    GPL16791              0 Illumina HiSeq 2500              cDNA
SRR3225256    GPL16791              0 Illumina HiSeq 2500              cDNA
SRR3225257    GPL16791              0 Illumina HiSeq 2500              cDNA
           library_source library_strategy               relation
              <character>      <character>            <character>
SRR3225255 transcriptomic          RNA-Seq SRA: https://www.ncb..
SRR3225256 transcriptomic          RNA-Seq SRA: https://www.ncb..
SRR3225257 transcriptomic          RNA-Seq SRA: https://www.ncb..
                       relation.1 supplementary_file_1     age:ch1
                      <character>          <character> <character>
SRR3225255 BioSample: https://w..                 NONE          69
SRR3225256 BioSample: https://w..                 NONE          50
SRR3225257 BioSample: https://w..                 NONE          49
           copd status:ch1 dysplasia status:ch1 genebody 80/20 ratio:ch1
               <character>          <character>              <character>
SRR3225255         No COPD            Dysplasia         1.34213035203572
SRR3225256         No COPD            Dysplasia         1.25508564705863
SRR3225257         No COPD            Dysplasia         1.20885950984169
            max histology:ch1 mean gc content:ch1
                  <character>         <character>
SRR3225255     Mild dysplasia               53.54
SRR3225256     Mild dysplasia               45.45
SRR3225257 Moderate dysplasia               44.44
           negative strand alignments:ch1 non-splice alignments:ch1
                              <character>               <character>
SRR3225255                       41653617                  62567474
SRR3225256                       44222609                  64646471
SRR3225257                       55046872                  79307301
           pack years:ch1 positive strand alignments:ch1
              <character>                    <character>
SRR3225255             58                       41837114
SRR3225256             34                       44366147
SRR3225257             35                       55244102
           properly paired alignments:ch1 read 1 alignment:ch1
                              <character>          <character>
SRR3225255                       66570222             42319429
SRR3225256                       70579486             45257546
SRR3225257                       87409558             56124363
           read 2 alignment:ch1     Sex:ch1 smoking status:ch1
                    <character> <character>        <character>
SRR3225255             41171302        Male      Former smoker
SRR3225256             43331210      Female     Current smoker
SRR3225257             54166611        Male     Current smoker
           splice alignments:ch1         tissue:ch1 total alignments:ch1
                     <character>        <character>          <character>
SRR3225255              20923257 Bronchial brushing             91000281
SRR3225256              23942285 Bronchial brushing             96810722
SRR3225257              30983673 Bronchial brushing            120861113
           unique alignments:ch1
                     <character>
SRR3225255              83490731
SRR3225256              88588756
SRR3225257             110290974

We have a total of 68 phenotypic variables. The second column geo_accession contains GEO Sample Accession Number (GSM) identifers. GSM identifiers define individual samples, understood in our context as individual sources of RNA. We can check if we have any technical replicates as follows:

length(unique(se$geo_accession))
[1] 81
table(lengths(split(colnames(se), se$geo_accession)))

 1 
81 

So, we have 81 different individual samples which matches the number of samples that we have 81. Therefore we don’t have any technical replicates and we can proceed with the next step.

To proceed further exploring this dataset, we are going to use the edgeR package and build a DGEList object, incorporating the gene metadata, which includes the gene symbol.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=rowData(se))
head(dge$samples, 3)
           group lib.size norm.factors
SRR3225255     1 35504068            1
SRR3225256     1 39395250            1
SRR3225257     1 49591671            1
dim(dge)
[1] 25122    81

Calculate \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation. This provides a standarized and normalized representation of gene expression data.

assays(se)$logCPM <- cpm(dge, log=TRUE)
assays(se)$logCPM[1:5, 1:5]
      SRR3225255 SRR3225256 SRR3225257 SRR3225258 SRR3225259
1     -2.8144890 -2.9017808 -4.1164076 -2.4911549  -3.436999
10    -1.8212210 -2.6511767 -3.6837191 -2.1459416  -1.077656
100   -0.0186939  0.9340788  0.2471703 -0.6559511   1.072922
1000   1.4206565  2.2413119  3.0040881  0.5618081   2.296503
10000  1.3426936  2.2490363  2.1152217  2.2466499   1.900561

Let’s explore now some of the phenotypic variables. After some visual inspection of the variables, we find out that the variable se$characteristics_ch1.4 contains information about how many cigarettes a person has smoked in their lifetime. We also have a variable which tells us whether the person has COPD (Chronic obstructive pulmonary disease) or not se$characteristics_ch1.5.

table(se$characteristics_ch1.4)

  pack years: 108   pack years: 114 pack years: 26.25    pack years: 30 
                1                 1                 1                 3 
 pack years: 31.5    pack years: 32  pack years: 32.5    pack years: 33 
                1                 2                 2                 1 
pack years: 33.25    pack years: 34    pack years: 35  pack years: 35.5 
                1                 2                 5                 1 
   pack years: 36  pack years: 36.4  pack years: 36.8    pack years: 37 
                1                 1                 1                 2 
pack years: 37.58    pack years: 38    pack years: 39    pack years: 40 
                1                 2                 1                 5 
   pack years: 41 pack years: 41.49    pack years: 42  pack years: 42.5 
                3                 1                 2                 1 
pack years: 42.66 pack years: 43.75    pack years: 45    pack years: 46 
                1                 1                 2                 1 
 pack years: 46.8    pack years: 47    pack years: 48  pack years: 49.5 
                1                 2                 1                 2 
   pack years: 50  pack years: 50.6    pack years: 51 pack years: 51.25 
                2                 2                 4                 1 
   pack years: 52    pack years: 53    pack years: 54    pack years: 57 
                1                 3                 2                 2 
   pack years: 58 pack years: 58.24  pack years: 60.8    pack years: 65 
                2                 1                 1                 2 
 pack years: 65.7    pack years: 69    pack years: 76    pack years: 77 
                1                 1                 1                 1 
   pack years: 90 
                1 
table(se$characteristics_ch1.5)

   copd status: COPD copd status: No COPD 
                  24                   57 

Another variable of interest could be max histology se$characteristics_ch1.6 which represents the different histology types (tissue characteristics).

table(se$characteristics_ch1.6)

       max histology: Hyperplasia         max histology: Metaplasia 
                               13                                 7 
    max histology: Mild dysplasia max histology: Moderate dysplasia 
                               35                                11 
            max histology: Normal   max histology: Severe dysplasia 
                               12                                 3 

To facilitate handling this variables we are going to perform the following modifications:

se$copd_status <- gsub("copd status: ", "", se$characteristics_ch1.5)
se$copd_status <- factor(se$copd_status, levels = c("COPD", "No COPD"))

se$histology <- gsub("max histology: ", "", se$characteristics_ch1.6)
se$histology <- factor(se$histology, levels(se$histology) <- c("Hyperplasia", "Metaplasia", "Mild dysplasia", "Moderate dysplasia", "Normal", "Severe dysplasia"))

To faciliate working with the pack years variable will split it into four categorical groups - Low, Medium, High and Very high.

To do that we will: 1. Calculate quartiles for the data.
2. Assign categories based on quartile ranges.
3. Convert the values into their corresponding categories.

se$characteristics_ch1.4 <- gsub("pack years: ", "", se$characteristics_ch1.4)
class(se$characteristics_ch1.4)
[1] "character"
se$characteristics_ch1.4 <- as.numeric(se$characteristics_ch1.4)

quartiles <- quantile(se$characteristics_ch1.4, probs = c(0.25, 0.5, 0.75))

Now that we have the quartiles, we can proceed to assign categories to the values based on these quartiles.

Values below the first quartile (Q1) will be categorized as “low.” Values between Q1 and the median (Q2) will be categorized as “medium.” Values between the median (Q2) and the third quartile (Q3) will be categorized as “high.” Values above the third quartile (Q3) will be categorized as “very high.”


se$pack_years <- cut(se$characteristics_ch1.4,
                          breaks = c(-Inf, quartiles[1], quartiles[2], quartiles[3], Inf),
                          labels = c("Low", "Medium", "High", "Very High"),
                          include.lowest = TRUE)

In Table 1 below, we show the three variables of interest (histology, COPD status and pack years) jointly to gather as much understanding as possible on the underlying experimental design.

Table 1: Table 2: Phenotypic variables.
Identifer Pack Years Histology COPD Status
SRR3225255 Very High Mild dysplasia No COPD
SRR3225256 Low Mild dysplasia No COPD
SRR3225257 Low Moderate dysplasia No COPD
SRR3225258 Medium Mild dysplasia No COPD
SRR3225259 Medium Mild dysplasia No COPD
SRR3225260 Very High Moderate dysplasia COPD
SRR3225261 High Mild dysplasia COPD
SRR3225262 Very High Moderate dysplasia No COPD
SRR3225263 Low Mild dysplasia COPD
SRR3225264 High Mild dysplasia COPD
SRR3225265 Medium Moderate dysplasia COPD
SRR3225266 Low Hyperplasia No COPD
SRR3225267 Medium Hyperplasia COPD
SRR3225268 Very High Mild dysplasia No COPD
SRR3225269 Very High Mild dysplasia COPD
SRR3225270 Low Hyperplasia COPD
SRR3225271 Medium Mild dysplasia No COPD
SRR3225272 High Normal No COPD
SRR3225273 Medium Hyperplasia COPD
SRR3225274 Very High Normal No COPD
SRR3225275 Low Hyperplasia No COPD
SRR3225276 High Mild dysplasia COPD
SRR3225277 Medium Mild dysplasia No COPD
SRR3225278 Medium Normal No COPD
SRR3225279 Very High Hyperplasia No COPD
SRR3225280 Medium Normal No COPD
SRR3225281 Very High Hyperplasia No COPD
SRR3225282 Very High Moderate dysplasia No COPD
SRR3225283 Low Mild dysplasia No COPD
SRR3225284 High Moderate dysplasia COPD
SRR3225285 Very High Severe dysplasia COPD
SRR3225286 Medium Normal No COPD
SRR3225287 Medium Moderate dysplasia No COPD
SRR3225288 Medium Moderate dysplasia No COPD
SRR3225289 Very High Normal No COPD
SRR3225290 High Normal No COPD
SRR3225291 Very High Normal COPD
SRR3225292 Medium Mild dysplasia No COPD
SRR3225293 Very High Mild dysplasia No COPD
SRR3225295 Low Mild dysplasia No COPD
SRR3225296 High Mild dysplasia COPD
SRR3225297 Low Normal No COPD
SRR3225298 Very High Hyperplasia No COPD
SRR3225299 Low Mild dysplasia No COPD
SRR3225300 High Moderate dysplasia No COPD
SRR3225301 Medium Normal No COPD
SRR3225302 High Normal No COPD
SRR3225303 High Mild dysplasia No COPD
SRR3225304 Medium Hyperplasia No COPD
SRR3225305 Low Mild dysplasia No COPD
SRR3225306 Low Mild dysplasia No COPD
SRR3225307 High Mild dysplasia No COPD
SRR3225308 High Hyperplasia No COPD
SRR3225309 Low Hyperplasia No COPD
SRR3225310 Very High Mild dysplasia COPD
SRR3225311 Medium Moderate dysplasia COPD
SRR3225312 High Mild dysplasia No COPD
SRR3225313 Low Mild dysplasia No COPD
SRR3225314 Low Hyperplasia No COPD
SRR3225315 Medium Moderate dysplasia COPD
SRR3225316 Low Mild dysplasia No COPD
SRR3225317 High Normal No COPD
SRR3225318 Low Mild dysplasia No COPD
SRR3225319 High Mild dysplasia COPD
SRR3225320 Very High Hyperplasia COPD
SRR3225321 High Severe dysplasia No COPD
SRR3225322 Very High Mild dysplasia No COPD
SRR3225323 High Mild dysplasia COPD
SRR3225324 Very High Mild dysplasia No COPD
SRR3225325 Low Mild dysplasia No COPD
SRR3225326 Medium Mild dysplasia COPD
SRR3225327 High Severe dysplasia COPD
SRR3225328 Very High Metaplasia COPD
SRR3225329 High Metaplasia No COPD
SRR3225330 Very High Metaplasia No COPD
SRR3225331 Medium Metaplasia No COPD
SRR3225332 Medium Mild dysplasia No COPD
SRR3225333 Low Metaplasia No COPD
SRR3225334 Low Metaplasia No COPD
SRR3225335 High Metaplasia COPD
SRR3225336 Low Mild dysplasia No COPD

2.2 Sequencing depth

In the following step we will evaluate the sequencing depth in terms of the total number of read counts that are mapped to each sample’s genome. The sequencing depth per sample, commonly referred to as library sizes, is displayed in Figure 1 below in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

The plot displays a range of sequencing depths for the samples, each represented by an SRR identifier. The sequencing depth across our samples ranges from 20 to 50 million reads, with the majority falling between 30 to 40 million reads. This indicates a generally high level of coverage, which is beneficial for accurate genomic analysis. The plot also suggests that the highest number of reads corresponds to mild and moderate dysplasia, indicating early stages of cellular abnormalities that could potentially progress to cancer. The predominance of mild dysplasia in these samples could be of particular interest for early detection and intervention studies. There does not appear to be any evident bias in terms of sequencing depth across the conditions represented.

2.3 Distribution of expression levels among samples

Figure 2 below shows the distribution of expression values per sample in logarithmic CPM units of expression.

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

There are no substantial differences between the samples in the distribution of expression values.

2.4 Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 3 shows the distribution of those values across genes.

Distribution of average expression level per gene.

Figure 3: Distribution of average expression level per gene

The histogram displays a bimodal distribution, which means there are two distinct peaks. The large peak to the left is the most prominent feature of the histogram, showing that most genes in this dataset are expressed at low levels or possibly not at all, under the conditions of the experiment. On the right part, the peak which indicates that as gene expression levels increase, fewer genes are found at these higher expression levels.

2.5 Filtering of lowly-expressed genes

In order to filter lowly-expressed genes, we calculate row means of expression data and then we apply logical masks. We filter out genes which do not meet certain criteria such as minimum expression levels and a sample threshold. This step is needed to refine the dataset and remove noise, ensuring the subsequent analysis is based on high-quality data. Figure 4 shows the distribution of those values across genes.

mask <- rowMeans(assays(se)$logCPM) > 1
  se.filt <- se[mask, ]
  dge.filt <- dge[mask, ]
  dim(se.filt)
[1] 13374    81
  
cpmcutoff <- round(10/min(dge$sample$lib.size/1e6), digits=1)
cpmcutoff
[1] 0.4

nsamplescutoff <- 5
nsamplescutoff
[1] 5
mask <- avgexp >= 4 & avgexp <= 10
se.filt <- se[mask, ]
dge.filt <- dge[mask, ]
dim(se.filt)
[1] 8930   81
Distribution of lowly-expressed genes.

Figure 4: Distribution of lowly-expressed genes

We are left with 8930 genes.

2.6 Normalization

We calculate now the normalization factors on the filtered expression data set.

dge.filt <- calcNormFactors(dge.filt)

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE,
                              normalized.lib.sizes=TRUE)

2.7 MA-plots

We examine now the MA-plots of the normalized expression profiles in Figure ??

MA-plots of filtered and normalized expression values.

Figure 5: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 6: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 7: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 8: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 9: MA-plots of filtered and normalized expression values

For genes above the horizontal line at M=0, expression is higher in the condition being tested against the reference, as for genes below, expression is lower. The red line represents a loess fit to the data, which should ideally be around M=0 if there is no differential expression between conditions. Any systematic deviation from M=0 indicates potential bias or systematic variation in the data. In most of the plots, the red line remains close to M=0 across the range of A values, indicating no systematic bias in expression changes relative to gene abundance. Some plots show a slight dip in the loess line at the lower end of the A scale, suggesting a potential bias for lowly expressed genes which could be potential candidates for differential expression. These points would be of interest for follow-up analyses.

2.8 Experimental design and batch identification

To identify potential columns which could introduce variability to batch, we looked at columns that contain information related to technical aspects, experimental conditions or processing details that could vaey between different batches. Here are some columns which we though could introduce unwanted variability: submission_date, last_update_date, type, channel_count, extract_protocol_ch1, extract_protocol_ch1.1, data_processing, data_processing.1, data_processing.2, data_processing.3, data_processing.4, data_processing.5, data_processing.6, platform_id, instrument_model, library_selection, library_source, library_strategy.

These variables have the same values across all samples which suggests that they are not contributing to the variability of the dataset. For example:

table(se.filt$histology, se.filt$library_strategy)
                    
                     RNA-Seq
  Hyperplasia             13
  Metaplasia               7
  Mild dysplasia          35
  Moderate dysplasia      11
  Normal                  12
  Severe dysplasia         3

There is perfect correlation between library strategy and histology. This means that differences between our samples will be due to biological differences rather than technical.

Now let’s examine the distribution between pack-years and histology.

table(se.filt$pack_years, se.filt$histology)
           
            Hyperplasia Metaplasia Mild dysplasia Moderate dysplasia Normal
  Low                 5          2             12                  1      1
  Medium              3          1              7                  5      4
  High                1          2              9                  2      4
  Very High           4          2              7                  3      3
           
            Severe dysplasia
  Low                      0
  Medium                   0
  High                     2
  Very High                1

We can see that not all combinations of pack-years and histology have been sequenced. Some combinations have missing values, indicated by 0s. However, with only two missing combinations out of the several possible combinations, the impact on the overall analysis may be minimal and not critical for the analysis.

We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating pack-years and histology. We calculate again log CPM values with a high prior count(3) to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 10.

Figure S6: Hierarchical clustering of the samples. Labels correspond to pack-years and sample identifer, while colors indicate histology groups.

Figure 10: Figure S6: Hierarchical clustering of the samples
Labels correspond to pack-years and sample identifer, while colors indicate histology groups.

Samples are linked together based on their similarity, with the branches connecting those most similar first. As we move up the tree, the connections represent progressively less similar groups until all are joined in a single cluster. Samples do not really cluster together by histology type, which means histology type is not the primary factor driving the similarity or dissimilarity between samples.

In Figure 11 we show the corresponding MDS plot.

Figure S7: Multidimensional scaling plot of the samples. Labels correspond to treatment and colors indicate sample group.

Figure 11: Figure S7: Multidimensional scaling plot of the samples
Labels correspond to treatment and colors indicate sample group.

3 Differential expression

We can do an analysis to observe differential expression based on the two following groupings:

Histology

We can separate the samples’ Max. Histology in two groups: Normal vs. Dysplasia.

The Normal group will contain all samples that are Normal or have Hyperplasia.

The Dysplasia group will contain all samples with Dysplasia: Mild Dysplasia, Moderate dysplasia and Severe dysplasia

COPD

(Chronic obstructive pulmonary disease)

Samples with COPD vs. samples without COPD

3.1 Analysing Fold-Change

In first instance, a simple way to find Differentially Expressed genes is to simply compare expression levels between the two groups of samples.

We can do this using fold-change, which compares the mean read counts using the logarithmic scale to try and stabilize variability across expression levels.

3.1.1 COPD

Our two groups of samples are : samples with COPD and samples with No COPD. We calculate the means of the values of the expression levels.

copd <- rowMeans(logCPM[, se.filt$copd_status=="COPD"])
no_copd <- rowMeans(logCPM[, se.filt$copd_status=="No COPD"])

Now we can build two plots to compare fold-change between the two groups.

par(mfrow=c(1, 2))
plot(copd, no_copd, xlab="COPD", ylab="No COPD", pch=".", cex=4, las=1)
plot((copd+no_copd)/2, copd-no_copd, xlab="(COPD + No COPD)/2", ylab="COPD - No COPD", pch=".", cex=4, las=1)

3.1.2 Histology

Our two groups of samples are No Dysplasia and Dysplasia. We create a new column in our SE to dichotomize the histology. We assign “Normal”, “Hyperplasia” and “Metaplasia” to the No Dysplasia group, and we assign “Mild dysplasia”, “Moderate dysplasia”, and “Severe dysplasia” to the Dysplasia group

se.filt$dysplasia[se.filt$histology == "Normal"] <- "No Dysplasia"
se.filt$dysplasia[se.filt$histology == "Hyperplasia"] <- "No Dysplasia"
se.filt$dysplasia[se.filt$histology == "Metaplasia"] <- "No Dysplasia"
se.filt$dysplasia[se.filt$histology == "Mild dysplasia"] <- "Dysplasia"
se.filt$dysplasia[se.filt$histology == "Moderate dysplasia"] <- "Dysplasia"
se.filt$dysplasia[se.filt$histology == "Severe dysplasia"] <- "Dysplasia"

#se.filt.dysplasia <- se.filt[, !is.na(se.filt$dysplasia)]
se.filt.dysplasia <- se.filt

no_dysplasia <- rowMeans(logCPM[, se.filt.dysplasia$dysplasia=="No Dysplasia"])
dysplasia <- rowMeans(logCPM[, se.filt.dysplasia$dysplasia=="Dysplasia"])

Now we can build two plots to compare fold-change between the two groups.

par(mfrow=c(1, 2))
plot(no_dysplasia, dysplasia, xlab="No Dysplasia", ylab="Dysplasia", pch=".", cex=4, las=1)
plot((no_dysplasia+dysplasia)/2, no_dysplasia-dysplasia, xlab="(No Dysplasia + Dysplasia)/2", ylab="No Dysplasia - Dysplasia", pch=".", cex=4, las=1)

3.2 Performing Differential Expression (DE)

We perform a simple assessment of the extent of expression changes and their associated p-values using the F-test implemented in the R/Bioconductor package sva. We compare two groups of samples in the following ways.

3.2.1 COPD

We compare samples that have COPD with samples that don’t have COPD. We have previously subsetted the data the way we needed.

We build now the corresponding full and null model matrices.

mod <- model.matrix(~ se.filt$copd_status, colData(se.filt))
mod0 <- model.matrix(~ 1, colData(se.filt))

Finally, we conduct the F-test implemented in the package sva and examine the amount of differential expression between samples with COPD and without.

library(sva)
library(limma)

pv <- f.pvalue(assays(se.filt)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.05)
[1] 8

We obtain 8 differentially expressed (DE) genes at FDR < 5%. In Figure 12 below we can see the distribution of the resulting p-values.

Distribution of raw p-values for an F-test on every gene between samples with COPD and samples without COPD.

Figure 12: Distribution of raw p-values for an F-test on every gene between samples with COPD and samples without COPD

We build a table with the subset of 29 DE genes with FDR < 10% and show the top-10 genes with lowest p-value in Table 3 below.

mask <- p.adjust(pv, method="fdr") < 0.5
DEgenesEGs <- names(pv)[mask]
DEgenesSyms <- mcols(se.filt)[DEgenesEGs, "symbol"]
DEgenesPvalue <- pv[mask]
DEgenesDesc <- mcols(se.filt)[DEgenesEGs, "description"]
DEgenesDesc <- sub(" \\[.+\\]", "", DEgenesDesc)
DEgenesTab <- data.frame(EntrezID=DEgenesEGs,
                         Symbol=DEgenesSyms,
                         Description=DEgenesDesc,
                         "P value"=DEgenesPvalue,
                         stringsAsFactors=FALSE, check.names=FALSE)
DEgenesTab <- DEgenesTab[order(DEgenesTab[["P value"]]), ] ## order by p-value
rownames(DEgenesTab) <- 1:nrow(DEgenesTab)
Table 3: Differentially expressed genes. Top-10 differentially expressed genes with lowest p-value between samples with COPD and without COPD with FDR < 5%. To see the full list of DE genes, follow this link or download this CSV file.
EntrezID Symbol Description P value
1 9875 URB1 URB1 ribosome biogenesis 1 homolog (S. cerevisiae) 3.00e-06
2 865 CBFB core-binding factor beta subunit 9.10e-06
3 134553 C5orf24 chromosome 5 open reading frame 24 9.60e-06
4 2957 GTF2A1 general transcription factor IIA subunit 1 1.54e-05
5 6815 STYX serine/threonine/tyrosine interacting protein 1.77e-05
6 26224 FBXL3 F-box and leucine rich repeat protein 3 2.01e-05
7 55432 YOD1 YOD1 deubiquitinase 3.11e-05
8 7257 TSNAX translin associated factor X 4.22e-05
9 7321 UBE2D1 ubiquitin conjugating enzyme E2 D1 5.27e-05
10 5532 PPP3CB protein phosphatase 3 catalytic subunit beta 6.65e-05

3.2.2 Histology

We compare samples that have Dysplasia with samples that don’t have Dysplasia. We have previously subsetted the data the way we needed.

We build now the corresponding full and null model matrices.

mod <- model.matrix(~ se.filt.dysplasia$dysplasia, colData(se.filt.dysplasia))
mod0 <- model.matrix(~ 1, colData(se.filt.dysplasia))

Finally, we conduct the F-test implemented in the package sva and examine the amount of differential expression between samples with Dysplasia and samples with Hyperplasia/Normal samples.

library(sva)
library(limma)

pv <- f.pvalue(assays(se.filt.dysplasia)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.05)
[1] 2579

We obtain 2579 differentially expressed (DE) genes at FDR < 5% and 3399 at FDR < 10%. In Figure 13 below we can see the distribution of the resulting p-values.

[1] 2579
Distribution of raw p-values for an F-test on every gene between samples with Dysplasia and samples with Hyperplasia/Normal samples.

Figure 13: Distribution of raw p-values for an F-test on every gene between samples with Dysplasia and samples with Hyperplasia/Normal samples

We build a table with the subset of 2579 DE genes with FDR < 5% and show the top-10 genes with lowest p-value in Table 4 below.

mask <- p.adjust(pv, method="fdr") < 0.05
DEgenesEGs <- names(pv)[mask]
DEgenesSyms <- mcols(se.filt.dysplasia)[DEgenesEGs, "symbol"]
DEgenesPvalue <- pv[mask]
DEgenesDesc <- mcols(se.filt.dysplasia)[DEgenesEGs, "description"]
DEgenesDesc <- sub(" \\[.+\\]", "", DEgenesDesc)
DEgenesTab <- data.frame(EntrezID=DEgenesEGs,
                         Symbol=DEgenesSyms,
                         Description=DEgenesDesc,
                         "P value"=DEgenesPvalue,
                         stringsAsFactors=FALSE, check.names=FALSE)

DEgenesTab <- DEgenesTab[order(DEgenesTab[["P value"]]), ] ## order by p-value
rownames(DEgenesTab) <- 1:nrow(DEgenesTab)
Table 4: Differentially expressed genes. Top-10 differentially expressed genes with lowest p-value between samples with Dysplasia and samples with No Dysplasia, with FDR < 5%. To see the full list of DE genes, follow this link or download this CSV file.
EntrezID Symbol Description P value
1 29123 ANKRD11 ankyrin repeat domain 11 0e+00
2 26505 CNNM3 cyclin and CBS domain divalent metal cation transport mediator 3 0e+00
3 23334 SZT2 SZT2, KICSTOR complex subunit 1e-07
4 5339 PLEC plectin 2e-07
5 23524 SRRM2 serine/arginine repetitive matrix 2 3e-07
6 79364 ZXDC ZXD family zinc finger C 3e-07
7 57154 SMURF1 SMAD specific E3 ubiquitin protein ligase 1 4e-07
8 9667 SAFB2 scaffold attachment factor B2 4e-07
9 4926 NUMA1 nuclear mitotic apparatus protein 1 5e-07
10 6170 RPL39 ribosomal protein L39 7e-07

4 Functional analysis

Here we will do the functional analysis.

5 Discussion

Here we discuss the findings.

6 Conclusions

Here we summarize our conclusions.

7 Session information

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Oslo
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] sva_3.50.0                  BiocParallel_1.36.0        
 [3] genefilter_1.84.0           mgcv_1.9-1                 
 [5] nlme_3.1-163                geneplotter_1.80.0         
 [7] annotate_1.80.0             XML_3.99-0.16.1            
 [9] AnnotationDbi_1.64.1        lattice_0.22-5             
[11] edgeR_4.0.16                limma_3.58.1               
[13] SummarizedExperiment_1.32.0 Biobase_2.62.0             
[15] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
[17] IRanges_2.36.0              S4Vectors_0.40.2           
[19] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
[21] matrixStats_1.3.0           usethis_2.2.3              
[23] here_1.0.1                  kableExtra_1.4.0           
[25] knitr_1.46                  BiocStyle_2.30.0           

loaded via a namespace (and not attached):
 [1] viridisLite_0.4.2       blob_1.2.4              Biostrings_2.70.3      
 [4] bitops_1.0-7            fastmap_1.1.1           RCurl_1.98-1.14        
 [7] digest_0.6.35           lifecycle_1.0.4         survival_3.5-8         
[10] statmod_1.5.0           KEGGREST_1.42.0         RSQLite_2.3.6          
[13] magrittr_2.0.3          compiler_4.3.3          rlang_1.1.3            
[16] sass_0.4.9              tools_4.3.3             utf8_1.2.4             
[19] yaml_2.3.8              S4Arrays_1.2.1          bit_4.0.5              
[22] DelayedArray_0.28.0     xml2_1.3.6              RColorBrewer_1.1-3     
[25] abind_1.4-5             KernSmooth_2.23-22      purrr_1.0.2            
[28] grid_4.3.3              fansi_1.0.6             xtable_1.8-4           
[31] colorspace_2.1-0        scales_1.3.0            tinytex_0.50           
[34] cli_3.6.2               rmarkdown_2.26          crayon_1.5.2           
[37] rstudioapi_0.16.0       httr_1.4.7              DBI_1.2.2              
[40] cachem_1.0.8            stringr_1.5.1           splines_4.3.3          
[43] zlibbioc_1.48.2         parallel_4.3.3          BiocManager_1.30.22    
[46] XVector_0.42.0          vctrs_0.6.5             Matrix_1.6-5           
[49] jsonlite_1.8.8          bookdown_0.39           bit64_4.0.5            
[52] systemfonts_1.0.6       magick_2.8.3            locfit_1.5-9.9         
[55] jquerylib_0.1.4         glue_1.7.0              codetools_0.2-19       
[58] stringi_1.8.3           munsell_0.5.1           tibble_3.2.1           
[61] pillar_1.9.0            htmltools_0.5.8.1       GenomeInfoDbData_1.2.11
[64] R6_2.5.1                rprojroot_2.0.4         evaluate_0.23          
[67] highr_0.10              png_0.1-8               memoise_2.0.1          
[70] bslib_0.7.0             Rcpp_1.0.12             svglite_2.1.3          
[73] SparseArray_1.2.4       xfun_0.43               fs_1.6.3               
[76] pkgconfig_2.0.3        

References

Blanco-Melo, Daniel, Benjamin E. Nilsson-Payant, Wen-Chun Liu, Skyler Uhl, Daisy Hoagland, Rasmus Møller, Tristan X. Jordan, et al. 2020. “Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19.” Cell 181: 1036–45. https://doi.org/10.1016/j.cell.2020.04.026.
Maglott, Donna, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. 2010. “Entrez Gene: Gene-Centered Information at NCBI.” Nucleic Acids Research 39 (suppl_1): D52–57. https://doi.org/10.1093/nar/gkq1237.
Ziemann, Mark, Antony Kaspi, and Assam El-Osta. 2019. “Digital Expression Explorer 2: A Repository of Uniformly Processed RNA Sequencing Data.” GigaScience 8 (4): giz022. https://doi.org/10.1093/gigascience/giz022.